Business Case

For this data science challenge, you are provided with a dataset containing mobility traces of ~500 taxi cabs in San Francisco collected over ~30 days. The format of each mobility trace file is the following - each line contains [latitude, longitude, occupancy, time], e.g.: [37.75134 -122.39488 0 1213084687], where latitude and longitude are in decimal degrees, occupancy shows if a cab has a fare (1 = occupied, 0 = free) and time is in UNIX epoch format.

The goal of this data science challenge is twofold:

  1. To calculate the potential for a yearly reduction in CO2 emissions, caused by the taxi cabs roaming without passengers. In your calculation please assume that the taxicab fleet is changing at the rate of 15% per month (from combustion engine-powered vehicles to electric vehicles). Assume also that the average passenger vehicle emits about 404 grams of CO2 per mile.

  2. To build a predictor for taxi drivers, predicting the next place a passenger will hail a cab.

  3. (Bonus question) Identify clusters of taxi cabs that you find being relevant from the taxi cab company point of view.

0. IMPORTS & HELPER FUNCTIONS

0.1 Imports

0.2 Helper Function

1. BUSINESS UNDERSTANDING, ASSUMPTIONS & APPROACHES

Firstly an in-depth analysis of the business objectives and needs has to be done. Current situation must be accessed and from these insights, the goals of carrying out the processes must be defined. This should follow the setting up of a plan to proceed. In our case, our aim is finding yearly CO2 emission reduction, predicting for taxi drivers, predicting the next place a passenger will hail a cab and Identify clusters of taxi cabs that you find being relevant.

Case 1 - CO2 emission reduction

In order to find yearly CO2 emission reduction,

Case 2 - Next place a passenger will hail a cab (Occupancy Model)

Based on on the given dataset with Lat, Lon, Time data, To predict the next place a passenger will hail a cab, I divided the question in to two phases.

Case 3 - Bonus Question - Identify clusters of taxi cabs that you find being relevant.

I apply two different methods to identify cluster of taxi cabs.

First Approach:

Second Approach

2. DATA UNDERSTANDING

We have mobility traces of ~500 taxi cabs in San Francisco collected over ~30 days. The format of each mobility trace file is the following - each line contains [latitude, longitude, occupancy, time]

2.1 Data Loading

3. DATA PREPERATION

Most data used for data mining was originally collected and preserved for other purposes and needs some refinement before it is ready to use for modeling.

The data preparation phase includes five tasks. These are

Selecting data

Cleaning data

Constructing data

Integrating data

Formatting data

3.1 Data Dimension

3.2 Data Types and Structure

3.3 Missing Values Check

There are no missing values

4. EXPLORATORY DATA ANALYSIS

4.1 Descriptive Stats

Distribution of Occupancy and Vacancy

By inspecting the data, I found that some coordinates are places in the Pacific ocean but we assume that this is a GPS error.

Maximum time is Jun 10 2008 11:19:56 GMT+0200 (Central European Summer Time) and minimum time is May 17 2008 12:00:13 GMT+0200 (Central European Summer Time)

Occupancy and vacancy rate seems to be equally distributed.

4.2 All Variable Stats - Distribution , Cardinality, Correlation

5. MODELLING

After data preparation, our data is already in good shape, and now you can search for useful patterns in your data. The modeling phase includes four tasks. These are

Selecting modeling techniques

Designing test(s)

Building model(s)

Assessing model(s)

5.1 Case 1: CO2 Reduction Analysis

Average vacant distance for each taxi driver who have been selected

5.2 Case 2. To build a predictor for taxi drivers, predicting the next place a passenger will hail a cab.

5.2.1 Case 2 - Phase 1 - Predicting Next Place [For Single Taxi Driver]

In order to create a new dataset as a timeseries mindset, I follow below steps

This is a timeseries approach so splitting train and test data should be first n% as train last (1-n) % as test. I select 80-20 split rule.

5.2.1.1 Model 1 : Dense Neural Network

Here, I create neural network with one input layer, one hidden layer and one output layer.

I follow MSE because this is a regression problem.

The neural network is trained for 50 epoches. Bellow, the training and testing losses are plotted. From the graph, we see that the network does not overfit. I exclude MSE of the first 15 epochs. Therefore, the MSE is already small and the difference between the train and test curve is visible. Otherwise, the loss at the first 15 epochs of the training is high and the two curves cannot be readable.

5.2.1.2 Model 2 : Gradient Boosting Regressor

In this model I try to predict Latitude and Longitude seperately. After predict the next Latitude and Longitude we can predict next coordinates. I use Gradient Boosting Modelling techniques.

In order to understand the Gradient Boosting, first we have to understand what is boosting.

Boosting: It is an approach where you take random samples of data, however the selection of sample is made more intelligently. We subsequently give more and more weight to hard to classify observations.

So Gradient Boosting basically combines weak learners into a single strong learner in an iterative fashion

5.2.1.2 (A) Predicting Latitude

5.2.1.2 (A) - i. Latitude Model Evaluation

5.2.1.2 (B) Predicting Longitude

5.2.1.2 (B) - i. Longitude Model Evaluations

5.2.1.3 Case 2 - Phase 1 - Model Evaluation (NN and GBM Models)

You’ll evaluate not just the models you create but also the process that you used to create them, and their potential for practical use.

The evaluation phase includes three tasks. These are;

Evaluating results

Reviewing the process

Determining the next steps

I also plot in the y-axis as a first dimension is Latitude, as a secondary dimension is Longitude. In the x-axis, I add time. Therefore, we can follow minutes by minutes latitude and longitude change. As a result, prediction looks good!

Note: The mean squared error tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. The squaring is necessary to remove any negative signs. It also gives more weight to larger differences. It’s called the mean squared error as you’re finding the average of a set of errors.

5.2.2 Case 2 - Phase 2 - Modelling Occupancy

5.2.2 Case 2 - Phase 2 - Model 1: Baseline Model

I select a naive model that predicts always 0 for the next location.

Naive model For naïve forecasts, we simply set all forecasts to be the value of the last observation.This method works remarkably well for many economic and financial time series.

5.2.2 Case 2 - Phase 2 - Model 2 (A): Random Forest Model

I also build Random Forests in order to predict if a datapoint is a pick-up location or not. RF outperforms the baseline approach significantly.

Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

5.2.2 Case 2 - Phase 2 - Model 2 (A) Feature Importance

We can also inspect and interpret the trained Random Forest classifier by analyzing the importance of each feature. Coordindates are the most important features to classify a pick-up point whereas the day feature does not help the classifier.

5.2.2 Case 2 - Phase 2 - Model 2 (B) Random Forrest after Adding new variable - Holiday Data

I think also adding holiday and weather data will be very useful to our model. So I added holiday data as a flag and retrain Random Forest Classifier model. There is only one holiday in the given period of time which is Memorial Day(26th of May) .Result of both models are same but precision of predicting hailing a cab is increase 1%. Therefore we can use holiday flag for our model and it gives other data such as weather will be valuable.

5.2.2.1 Case 2 - Phase 1 - Model Evaluation - RF on data without Holiday / RF on data with Holiday

I use accuracy as a metric, since the problem seems to be balanced (55% of the labels are 0 and 45% of the labels are 1). In a scenario with higher imbalance I could have used F1 score. So formulas and meanings for accuracy, F1 score, etc. below

Accuracy: The proportion of the total number of predictions that were correct.

Positive Predictive Value (Precision): The proportion of positive cases correctly identified.

Negative Predictive Value: The proportion of negative cases correctly identified.

Sensitivity (Recall): The proportion of actual positive cases correctly identified.

Specificity: The proportion of actual negative cases correctly identified.

F1 Score= (2 Precision Recall) / (Precision + Recall)

Kappa = (Observed Accuracy -Expected Accuracy) / (1 -Expected Accuracy)

Hyperparameter Fine Tuning [ Not Ran ]

5.3 Case 3 - Bonus Question - Clustering Taxi Cabs

I apply two different methods to identify cluster of taxi cabs.

First Approach:

Second Approach

5.3.1 DBSCAN

5.3.2 Businesswise Segmentation (RFM Segmentation)

When I think the data that I have, total miles per cab, total occupied miles per cab and average active minutes per day variable are more important than others. I extract those data below and create new dataset for RFM mindset segmentation

After merging data, I found the quantiles value. This will help me to divide my data into four in terms of every variable

So we can give clusters a name accordingly. Here is some example of it.

6. Depolyment Strategy

image.png